-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Interpret data descriptors when reading zip file from (read, nonseek) stream #197
base: master
Are you sure you want to change the base?
Conversation
…reader: &mut S) that reads the next zipfile entry from a stream by potential parsing the data descriptor This is an alternative method to read a zip file. If possible, use the ZipArchive functions as some information will be missing when reading this manner. This method extends the existing read_zipfile_from_stream method when the stream is seekable. This method is superior to ZipArchive in the special case that the underlying stream implementation must buffer all seen/read data to provide seek back support and the memory consumption must be kept small. This could be the case when reading a ZipFile B.zip nested within a zip file A.zip, that is stored on disk. Since A is seekable when stored as file, A.zip can be read using ZipArchive. Be B a zip file stored in A, then we can read B's content using a decompressing reader. The problem is that the decompressing reader will not be seekable (due to decompression). When one would like to still read contents of B.zip without extracting A to disk, the file B.zip must be buffered in RAM. When using ZipArchive to read B.zip from RAM, the whole B.zip file must be buffered to RAM because the central directory of B.zip is located at the end of the file B.zip. This method will read B.zip from the start of the file and returning the first file entry found in B.zip. After the execution of this function and the ZipFile return value is dropped, the reader will be positioned at the end of the file entry. Since this function will never seek back to before the initial position of the stream when the function was called, the underlying stream implementation may discard, after dropping ZipFile, all buffered data before the current position of the stream. Summarizing: In given scenario, this method must not buffer the whole B.zip file to RAM, but only the first file entry.
…sulate potential unsafe data
…plate T: Read in internal data structures Added UntrustedValue and MaybeUntrusted data types
# Conflicts: # src/read.rs
Before: The CRC32 checksum had to be supplied before the stream is read. Though checked when EOF occurred. Now: The CRC32 checksum can be supplied before starting to read or after finishing reading from the stream.
The changes kind of exploded 🙈 😅 - but now it finally works. Looking forward to a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decompressing readers only exist when the feature flag for their compression method is enabled. I may have more comments later.
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems mostly functionally correct, but I have some performance concerns.
} | ||
|
||
let limit_reader = (reader as &'a mut dyn Read).take(result.compressed_size); | ||
fn read_a_byte(&mut self) -> io::Result<Option<u8>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading one byte at a time like this won't be efficient. Instead, read a block into the look-ahead buffer and use memchr::memmem::find
to check for a data descriptor signature. (Be sure to leave size_of::<Magic>()
bytes in the buffer, in case the descriptor signature straddles two blocks!)
src/read.rs
Outdated
if spec::Magic::from_first_le_bytes(&self.look_ahead_buffer) | ||
== spec::Magic::DATA_DESCRIPTOR_SIGNATURE | ||
{ | ||
// potentially found a data descriptor | ||
// check if size matches | ||
let data_descriptor = match ZipDataDescriptor::interpret(&self.look_ahead_buffer) { | ||
Ok(data_descriptor) => data_descriptor, | ||
Err(_) => return None, | ||
}; | ||
let data_descriptor_size = data_descriptor.compressed_size; | ||
|
||
if data_descriptor_size == self.number_read_total_actual as u32 { | ||
return Some(data_descriptor); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interpret
already checks whether the magic matches.
if spec::Magic::from_first_le_bytes(&self.look_ahead_buffer) | |
== spec::Magic::DATA_DESCRIPTOR_SIGNATURE | |
{ | |
// potentially found a data descriptor | |
// check if size matches | |
let data_descriptor = match ZipDataDescriptor::interpret(&self.look_ahead_buffer) { | |
Ok(data_descriptor) => data_descriptor, | |
Err(_) => return None, | |
}; | |
let data_descriptor_size = data_descriptor.compressed_size; | |
if data_descriptor_size == self.number_read_total_actual as u32 { | |
return Some(data_descriptor); | |
} | |
} | |
let Ok(data_descriptor) = ZipDataDescriptor::interpret(&self.look_ahead_buffer) else { | |
return None; | |
}; | |
let data_descriptor_size = data_descriptor.compressed_size; | |
if data_descriptor_size == self.number_read_total_actual as u32 { | |
Some(data_descriptor) | |
} else { | |
None | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the CRC32 here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs some architectural changes. The function above operates on the compressed zip file. The CRC is calculated from the uncompressed data. With the current architecture checking CRC in this function is not possible.
Currently I have no good idea to solve this. Still, the CRC is checked, so if an "data descriptor" occurs in the stream - Which is only the case if the file size in the data descriptor is correct - the CRC is likely wrong. In this case only the first half/or part of the file is returned to the user, but then an error is thrown when the CRC check fails; reaching the stream EOF.
Review todos:
|
…m' into feature-read-from-seekable-stream
let data_descriptor_size = data_descriptor.compressed_size; | ||
|
||
if data_descriptor_size == self.number_read_total_actual as u32 { | ||
// TODO: check CRC32 here as well |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
run cargo fmt & clippy
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
When reading a zipfile from within a zipfile, the inside zipfile stream is not seekable. Given the case that we are dealing with large files, it might be unfeasible to buffer the whole nested file to RAM. It may be better to stream the zip file instead. Relying on the local file header and data descriptors to determine the entries of the zip file. Completely ignoring the central directory at the file end.
Current project state:
My contribution:
Current state:
Security Considerations:
Reading a zip archive that way without knowing the file size in advance may result in parsing inconsistency. Since the file content of a single file may be attacker controlled, we must assume that an attacker may craft a file that contains a sequence looking like a data descriptor. After that an attacker might put the header of a new zip file entry.
Parsing the file in streaming fashion will return two files. Parsing with the central directory information will results in just one file read. This should be regarded as security risk. Still, allowing to read zip files with data descriptors in a streamed fashion is useful for some applications. I therefore introduced the types UntrustedValue and MaybeUnstrusted. See https://github.com/0xCCF4/zip2/blob/f6b5da958028c5cb90d6de7b5b56a378de52fc95/src/result.rs#L110
If a library user would like to use this streaming functionality this security risk must be explicitly accepted.
Related issues:
#162